🏆 Data Science I - ML Competition: Predict Academic Performance¶
Welcome to the Machine Learning Kaggle Competition!
🎯 Objective¶
Use the provided dataset to predict students' final grades based on various features (study time, failures, absences, etc.).
🧠 Your Task¶
Train and tune multiple machine learning models and aim for the best F1 Score. For the competition, you will:
- Implement 2 of the following: Random Forest, Support Vector Machine, Neural Network
- Include hyperparameter tuning for improved predictive accuracy
- Conclude with evaluation of the two selected models based on metrics shown in the classification report
🧾 Dataset¶
- Filename:
Students_Grading_Dataset.csv - Target Column:
B_or_Higher
Let's get started!
In [45]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
rcParams['figure.figsize'] = (15, 8)
Section 1: Preprocessing¶
Question 1: Load the Academic Performance Dataset
In [46]:
academic_df = pd.read_csv('Academic_Dataset.csv')
Question 2: Show the first 5 rows
In [47]:
academic_df.head()
Out[47]:
| Student_ID | Gender | Age | Attendance (%) | Midterm_Score | Final_Score | Assignments_Avg | Quizzes_Avg | Participation_Score | Projects_Score | B_or_Higher | Study_Hours_per_Week | Extracurricular_Activities | Internet_Access_at_Home | Parent_Education_Level | Family_Income_Level | Stress_Level (1-10) | Sleep_Hours_per_Night | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | S1000 | 0 | 22 | 52.29 | 55.03 | 57.82 | 84.22 | 74.06 | 3.99 | 85.90 | 0 | 6.2 | 0 | 1 | 1 | Medium | 5 | 4.7 |
| 1 | S1002 | 1 | 24 | 57.19 | 67.05 | 93.68 | 67.70 | 85.70 | 5.05 | 73.79 | 0 | 20.7 | 0 | 1 | 3 | Low | 6 | 6.2 |
| 2 | S1003 | 0 | 24 | 95.15 | 47.79 | 80.63 | 66.06 | 93.51 | 6.54 | 92.12 | 1 | 24.8 | 1 | 1 | 1 | High | 3 | 6.7 |
| 3 | S1004 | 0 | 23 | 54.18 | 46.59 | 78.89 | 96.85 | 83.70 | 5.97 | 68.42 | 0 | 15.4 | 1 | 1 | 1 | High | 2 | 7.1 |
| 4 | S1005 | 1 | 21 | NaN | 78.85 | 43.53 | 71.40 | 52.20 | 6.38 | 67.29 | 1 | 8.5 | 1 | 1 | 4 | High | 1 | 5.0 |
Question 3: Show the nonnull values for all columns
In [48]:
academic_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3206 entries, 0 to 3205 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Student_ID 3206 non-null object 1 Gender 3206 non-null int64 2 Age 3206 non-null int64 3 Attendance (%) 2863 non-null float64 4 Midterm_Score 3206 non-null float64 5 Final_Score 3206 non-null float64 6 Assignments_Avg 2886 non-null float64 7 Quizzes_Avg 3206 non-null float64 8 Participation_Score 3206 non-null float64 9 Projects_Score 3206 non-null float64 10 B_or_Higher 3206 non-null int64 11 Study_Hours_per_Week 3206 non-null float64 12 Extracurricular_Activities 3206 non-null int64 13 Internet_Access_at_Home 3206 non-null int64 14 Parent_Education_Level 3206 non-null int64 15 Family_Income_Level 3206 non-null object 16 Stress_Level (1-10) 3206 non-null int64 17 Sleep_Hours_per_Night 3206 non-null float64 dtypes: float64(9), int64(7), object(2) memory usage: 451.0+ KB
Question 4: Show the min, max, median, mean for all columns
In [49]:
academic_df.describe()
Out[49]:
| Gender | Age | Attendance (%) | Midterm_Score | Final_Score | Assignments_Avg | Quizzes_Avg | Participation_Score | Projects_Score | B_or_Higher | Study_Hours_per_Week | Extracurricular_Activities | Internet_Access_at_Home | Parent_Education_Level | Stress_Level (1-10) | Sleep_Hours_per_Night | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3206.000000 | 3206.000000 | 2863.000000 | 3206.000000 | 3206.000000 | 2886.000000 | 3206.000000 | 3206.000000 | 3206.000000 | 3206.000000 | 3206.000000 | 3206.000000 | 3206.000000 | 3206.000000 | 3206.000000 | 3206.000000 |
| mean | 0.511853 | 21.039925 | 75.391753 | 70.048958 | 69.445165 | 75.186372 | 74.787645 | 5.004766 | 74.853624 | 0.493450 | 17.658484 | 0.301622 | 0.896132 | 2.506550 | 5.459139 | 6.480162 |
| std | 0.499937 | 1.996322 | 14.295928 | 17.089972 | 17.210758 | 14.370601 | 14.601180 | 2.869489 | 14.452116 | 0.500035 | 7.276542 | 0.459034 | 0.305136 | 1.121811 | 2.855225 | 1.457031 |
| min | 0.000000 | 18.000000 | 50.010000 | 40.010000 | 40.000000 | 50.000000 | 50.030000 | 0.000000 | 50.010000 | 0.000000 | 5.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 4.000000 |
| 25% | 0.000000 | 19.000000 | 63.140000 | 55.442500 | 54.460000 | 62.632500 | 62.250000 | 2.530000 | 62.142500 | 0.000000 | 11.400000 | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 5.200000 |
| 50% | 1.000000 | 21.000000 | 75.730000 | 70.025000 | 69.295000 | 75.255000 | 74.475000 | 5.000000 | 74.930000 | 0.000000 | 17.400000 | 0.000000 | 1.000000 | 2.000000 | 5.000000 | 6.500000 |
| 75% | 1.000000 | 23.000000 | 87.170000 | 84.487500 | 84.102500 | 87.517500 | 87.657500 | 7.530000 | 87.340000 | 1.000000 | 24.100000 | 1.000000 | 1.000000 | 4.000000 | 8.000000 | 7.700000 |
| max | 1.000000 | 24.000000 | 100.000000 | 99.980000 | 99.980000 | 99.980000 | 99.960000 | 10.000000 | 100.000000 | 1.000000 | 30.000000 | 1.000000 | 1.000000 | 4.000000 | 10.000000 | 9.000000 |
Question 5: Drop any columns that are not needed
In [50]:
academic_df = academic_df.drop(columns='Student_ID')
Question 6: Address missing values through Imputation or Dropping
In [51]:
academic_df.dropna()
academic_df['Attendance (%)'] = academic_df['Attendance (%)'].fillna(academic_df['Attendance (%)'].mean())
academic_df['Assignments_Avg'] = academic_df['Assignments_Avg'].fillna(academic_df['Assignments_Avg'].mean())
Question 7: Convert the Family_Income_Level column to numeric
In [52]:
family_map = {'Low': 0, 'Medium': 1, 'High': 2}
academic_df['Family_Income_Level'] = academic_df['Family_Income_Level'].map(family_map)
academic_df.head()
Out[52]:
| Gender | Age | Attendance (%) | Midterm_Score | Final_Score | Assignments_Avg | Quizzes_Avg | Participation_Score | Projects_Score | B_or_Higher | Study_Hours_per_Week | Extracurricular_Activities | Internet_Access_at_Home | Parent_Education_Level | Family_Income_Level | Stress_Level (1-10) | Sleep_Hours_per_Night | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 22 | 52.290000 | 55.03 | 57.82 | 84.22 | 74.06 | 3.99 | 85.90 | 0 | 6.2 | 0 | 1 | 1 | 1 | 5 | 4.7 |
| 1 | 1 | 24 | 57.190000 | 67.05 | 93.68 | 67.70 | 85.70 | 5.05 | 73.79 | 0 | 20.7 | 0 | 1 | 3 | 0 | 6 | 6.2 |
| 2 | 0 | 24 | 95.150000 | 47.79 | 80.63 | 66.06 | 93.51 | 6.54 | 92.12 | 1 | 24.8 | 1 | 1 | 1 | 2 | 3 | 6.7 |
| 3 | 0 | 23 | 54.180000 | 46.59 | 78.89 | 96.85 | 83.70 | 5.97 | 68.42 | 0 | 15.4 | 1 | 1 | 1 | 2 | 2 | 7.1 |
| 4 | 1 | 21 | 75.391753 | 78.85 | 43.53 | 71.40 | 52.20 | 6.38 | 67.29 | 1 | 8.5 | 1 | 1 | 4 | 2 | 1 | 5.0 |
Section 2: Data Exploration¶
Question 1: Create a pairplot of the dataset
In [71]:
sns.pairplot(academic_df, hue='B_or_Higher', palette='husl')
plt.show()